Tutorial for R packge htlgmm

Heterogeneous Transfer Learning Dealing with Unmatched Features

setup

In recent days, datasets with information on all desired features typically have small sample sizes, but one or more external studies with limited set of predictors sometimes have large sample sizes. HTL-GMM method (Heterogeneous Transfer Learning with Generalized Method of Moments) aims to build GLMs and Cox PH models with all desired features incorporating the external studies.

This tutorial serves as an introduction of HTL-GMM method. We will introduce the R package usage and some application examples. Please refer to https://arxiv.org/abs/2312.12786 for the work of HTL-GMM method.

Installation

devtools::install_github("RuzhangZhao/htlgmm",force = T)

Method Recap

setup

  • Main study and external study can from the same or different population.

  • External study can be in the form of individual data or trained model with parameter estimation.

  • \({\bf A}\) and \(\tilde{\bf A}\): study-specific adjustment factors; \({\bf Z}\): overlapping features in both main and external studies; \({\bf W}\): unmatched features only in main study; \({\bf Y}\): outcome variable (family = “gaussian”: continuous; family = “binomial”: binary)

Usage of Package htlgmm

Primary Functions of Package htlgmm

We primarily provid three functions htlgmm, cv.htlgmm, gwas.htlgmm. The first two work similarly like functions glmnet and cv.glmnet from R package glmnet. cv.glmnet/cv.htlgmm does cross validation with automatically generated tuning parameter list(s) to select tuning parameters. We recommend users to use cv.htlgmm without knowing prefixed tuning parameters.

htlgmm setup

cv.htlgmm setup

gwas.htlgmm setup

Simulation Examples with Binary Outcome

In this simulation example, we illustrate the usage of cv.htlgmm for binary outcome.

Simulation Setup

The simulation example is to illustrate how the cv.htlgmm method is implemented, compared with cv.glmnet. The simulation is performed under both linear and logistic regression settings. We use 20 overlapping features (Z) and 3 unmatched features (W). 3/20 of Z whose effects on outcome are nonzero, while 2/3 of W whose effects on outcome are nonzero.

Main study sample size n is 400, while external study sample size nE is 2000.

library(htlgmm,quietly = T)
pZ=20 # overlapping features
pW=100 # unmatched features 
coef<-c(rep(0,pZ+pW))
coef[1:3]<-0.5
coef[c(21,22)]<-0.5
n=500
nE=3000
n_joint=n+nE
main_index<-1:n
print(paste0("Nonzero positions: ",paste(which(coef!=0),collapse = ", ")))
## [1] "Nonzero positions: 1, 2, 3, 21, 22"

We jointly generate main and external studies.

y is generated with both Z and W.

set.seed(1)
Z_joint<-matrix(rnorm(n_joint*pZ),n_joint,pZ)
colnames(Z_joint)<-paste0("Z",1:pZ)
W_joint<-matrix(rnorm(n_joint*pW),n_joint,pW)
colnames(W_joint)<-paste0("W",1:pW)
Z<-Z_joint[main_index,]  # separate main and external study for Z
ZE<-Z_joint[-main_index,]

W<-W_joint[main_index,] # only need main study for W

y_joint<-rbinom(n=n_joint,size=1,prob=expit(cbind(Z_joint,W_joint)%*%coef))
y<-y_joint[main_index] # separate main study y
yE<-y_joint[-main_index] # separate external study y

Display the data structure of main study (y, Z, W) with the first 2 rows and first 30 columns.

main_study<-cbind(y,Z,W)
head(main_study,2)[,1:30]
##      y         Z1        Z2         Z3         Z4        Z5         Z6
## [1,] 1 -0.6264538 0.5205997 -1.3254177  0.7051107 0.5232667 -2.4675028
## [2,] 0  0.1836433 0.3775619  0.9519797 -0.6268688 0.9935537 -0.7000656
##              Z7         Z8          Z9       Z10        Z11       Z12
## [1,]  0.6010915 -0.9252129  0.97027832 0.5375559  0.6986309 0.5574862
## [2,] -2.7671158  0.1805965 -0.02072336 0.2271813 -1.1650711 0.3520408
##             Z13       Z14         Z15       Z16       Z17        Z18      Z19
## [1,]  0.3707608 2.5285947 -0.06455497 -1.504228 1.2651240  0.7859485 1.380371
## [2,] -0.7265963 0.1232134 -0.58150321  1.093152 0.5123215 -0.4794245 1.682890
##            Z20          W1         W2       W3         W4         W5         W6
## [1,] 0.3768347 -1.00203611 -0.7132184 1.042499  0.9530678  1.1411060  0.3168121
## [2,] 1.2553529  0.02590761 -1.0166592 1.080826 -0.4157007 -0.6290421 -1.2707453
##             W7         W8          W9
## [1,] 0.7156816 -0.9463759  0.52378339
## [2,] 1.4276169 -1.4034296 -0.06035219

Display the data structure of external study (yE, ZE).

external_study<-cbind(yE,ZE)
head(external_study,2)
##      yE          Z1         Z2         Z3        Z4         Z5        Z6
## [1,]  0  0.07730312 -1.1346302 -0.7948034 -1.411522 -0.5676023 0.9514099
## [2,]  1 -0.29686864  0.7645571  0.6060846  1.083870  0.4837350 0.4570987
##              Z7         Z8         Z9        Z10       Z11       Z12        Z13
## [1,] -0.5682999  0.1965621  0.5263087 -1.0905102 0.8084132 -1.680998 -0.4447731
## [2,]  0.1131262 -0.4199427 -1.1767110 -0.4629759 1.5886737  1.265549  0.4094111
##             Z14        Z15        Z16        Z17       Z18       Z19       Z20
## [1,] -1.6220979  0.5468718 -0.6363353 -1.6490749 0.3413341  1.564013 -1.024849
## [2,]  0.1897096 -0.1667157  0.7258137 -0.7714769 0.4136665 -0.795476 -1.249276

Scale Z and W predictors

Z<-scale(Z)
ZE<-scale(ZE)
W<-scale(W)

Train with Main Study by cv.glmnet

We do cv.glmnet for lasso only on main study.

library(glmnet,quietly = T)
## Loaded glmnet 4.1-8
res_glmnet<-cv.glmnet(x=cbind(Z,W),y=y,family="binomial")

Train with External Study by glm

Summarize external study to be trained model with parameter estimation. Here we use OLS with external study (yE,ZE). Extract estimated coefficients (other than intercept terms, only for Z), variance-covariance matrix. Convert the results into list.

resglm<-glm(y~.,data = data.frame(y=yE,ZE),family = "binomial")
study_external = list(
                Coeff=resglm$coefficients[-1],
                Covariance=vcov(resglm)[-1,-1],
                Sample_size=nE)
ext_study_info<-list()
ext_study_info[[1]] <- study_external

Train with Main and External Studies by cv.htlgmm

Key Parameter List of Function cv.htlgmm

setup

  • y : Response Variable.

  • Z : Overlapping Features.

  • W : Unmatched Features.

  • A : Study-specific Adjustment Factors. Ignore the input of A if there is no intercept term (all variables are centered) when family = “gaussian” or only intercept term when family = “binomial”.

  • ext_study_info : Parameter estimation from External Study regarding overlapping features. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance (variance-covariance matrix).

  • family : Currently focus on linear (“gaussian”) and logistic (“binomial”).

  • penalty_type : Select from c(“none”,“adaptivelasso”,“lasso”,“ridge”,“elasticnet”). We allow all penalty choices available from glmnet.

Train with cv.htlgmm, the family is binomial. First train HTL-GMM using lasso. The tune_weight option introduces the regularization of weighting matrix.

res_htlgmm<-cv.htlgmm(y=y,
                      Z=Z,
                      W=W,
                      ext_study_info = ext_study_info,
                      family="binomial",
                      tune_weight = T,
                      penalty_type = "lasso",
                      beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))

For HTL-GMM using adaptive lasso, the adaptive weight is recommended using weight from ridge regression. When using adaptive lasso, the inference outputs are generated by default.

res_ridge<-cv.glmnet(x=cbind(Z,W),y=y,alpha=0,family="binomial")
weight_adaptivelasso<-1/abs(c(coef(res_ridge,s="lambda.min")[-1]))^(1/2)
res_htlgmm_ada<-cv.htlgmm(y=y,
                      Z=Z,
                      W=W,
                      ext_study_info = ext_study_info,
                      family="binomial",
                      tune_weight = T,
                      penalty_type = "adaptivelasso",
                      weight_adaptivelasso = weight_adaptivelasso,
                      beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))

Check the estimation error between estimated coefficients and true coefficients.

est_coef_htlgmm<-res_htlgmm$beta[-1]
ee_htlgmm<-sum((coef-est_coef_htlgmm)^2)
ee_htlgmm<-round(ee_htlgmm,4)

est_coef_lasso<-coef.glmnet(res_glmnet,s="lambda.min")[-1]
ee_lasso<-sum((coef-est_coef_lasso)^2)
ee_lasso<-round(ee_lasso,4)

cat(paste0("Estimation Error: ","lasso(",ee_lasso,");\n htlgmm(",ee_htlgmm,")"))
## Estimation Error: lasso(0.2046);
##  htlgmm(0.1903)

Check the out of sample prediction performance on test data.

n_test = 2000
set.seed(203)
Z_test<-matrix(rnorm(n_test*pZ),n_test,pZ)
colnames(Z_test)<-paste0("Z",1:pZ)
W_test<-matrix(rnorm(n_test*pW),n_test,pW)
colnames(W_joint)<-paste0("W",1:pW)

y_test<-rbinom(n=n_test,size=1,prob=expit(cbind(Z_test,W_test)%*%coef))

library(pROC,quietly = T)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# intercepts of estimated coefficients are ignored 
# since intercepts do not affect AUC. 
pred_y_htlgmm<-cbind(Z_test,W_test)%*%est_coef_htlgmm
pred_y_lasso<-cbind(Z_test,W_test)%*%est_coef_lasso
  
suppressMessages(auc_htlgmm<-c(auc(y_test,c(htlgmm::expit(pred_y_htlgmm)),direction = "<")))
suppressMessages(auc_lasso<-c(auc(y_test,c(htlgmm::expit(pred_y_lasso)),direction = "<")))

cat(paste0("AUC on test data: ","lasso(",round(auc_lasso,4),");\n htlgmm(",round(auc_htlgmm,4),")"))
## AUC on test data: lasso(0.7355);
##  htlgmm(0.7423)

Output list of cv.htlgmm

  • beta : Target coefficient estimation of all predictors. The coefficients go with the order of (A, Z, W).

  • lambda_list: The lambda list for cross validation.

  • weight_list: The weight list for cross validation, when set tune_weight = TRUE.

  • cv_mse: The mean square error(mse) when family = “gaussian” for cross validation.

  • cv_auc: The area under ROC curve(auc) when family = “binomial” for cross validation. For each lambda, cv_auc is the average auc across all folds.

  • cv_dev: The deviance(dev) when family = “binomial” for cross validation An alternative choice of cv_auc.

  • lambda_min: The selected best lambda.

  • weight_min: The selected best weight, when set tune_weight = TRUE.

  • selected_vars: A list of results for predictors with nonzero estimated coefficients.

    • position: the position of nonzero predictors.

    • name: the name of nonzero predictors.

    • coef: the coefficient of nonzero predictors.

    • variance: the variance of nonzero predictors.

    • variance-covariance: the variance-covariance matrix of nonzero predictors.

    • pval: the p values of nonzero predictors.

Example of beta

est_coef_htlgmm<-res_htlgmm$beta
names(est_coef_htlgmm)<-c("Intercept",paste0('Z',1:pZ),paste0('W',1:pW))
est_coef_htlgmm[1:30] # only show the first 30 coefficients
##    Intercept           Z1           Z2           Z3           Z4           Z5 
## -0.002721408  0.409342577  0.383107817  0.329962010  0.000000000  0.000000000 
##           Z6           Z7           Z8           Z9          Z10          Z11 
##  0.000000000  0.000000000  0.000000000  0.000000000 -0.001099747  0.000000000 
##          Z12          Z13          Z14          Z15          Z16          Z17 
##  0.000000000 -0.004052489 -0.024355986  0.020497717 -0.031639490  0.000000000 
##          Z18          Z19          Z20           W1           W2           W3 
##  0.000000000  0.000000000  0.000000000  0.254542621  0.248371849  0.000000000 
##           W4           W5           W6           W7           W8           W9 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000

HTLGMM selected_vars

print(res_htlgmm_ada$selected_vars$coef)
## [1] -0.002824616  0.434463883  0.403189739  0.306472840  0.281940098
## [6]  0.281515701 -0.040083609
print(res_htlgmm_ada$selected_vars$variance)
## [1] 0.008062514 0.001946540 0.002115148 0.001953433 0.008557824 0.009278032
## [7] 0.009342087
print(res_htlgmm_ada$selected_vars$pval)
## [1] 9.749047e-01 7.033366e-23 1.838812e-18 4.086938e-12 2.305841e-03
## [6] 3.470823e-03 6.783540e-01

The FDR adjusted positions passing significant level 0.05 after BH adjustment.

print(paste0("The predictors whose effects on outcome are nonzero: ",paste(c(paste0('Z',1:pZ),paste0('W',1:pW))[coef!=0],collapse = ", ")))
## [1] "The predictors whose effects on outcome are nonzero: Z1, Z2, Z3, W1, W2"
print(paste0("The predictors selected by htlgmm after FDR adjustment :",paste(res_htlgmm_ada$selected_vars$FDR_adjust_name,collapse = ", ")))
## [1] "The predictors selected by htlgmm after FDR adjustment :Z1, Z2, Z3, W1, W2"

The lambda list and cross validation auc.

head(lambda_mse_df<-data.frame(lambda=res_htlgmm$lambda_list,
                  cv_mse=t(res_htlgmm$cv_auc)))
##      lambda  cv_mse.1  cv_mse.2  cv_mse.3  cv_mse.4  cv_mse.5  cv_mse.6
## 1 0.3525870 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## 2 0.3212641 0.6855111 0.6574222 0.6280889 0.5395556 0.5368889 0.5368889
## 3 0.2927239 0.7086222 0.6894222 0.6728889 0.6392889 0.6323556 0.6323556
## 4 0.2667191 0.7143111 0.7038222 0.6956444 0.6824889 0.6752000 0.6730667
## 5 0.2430245 0.7153778 0.7088000 0.7043556 0.6936889 0.6904889 0.6887111
## 6 0.2214349 0.7169778 0.7114667 0.7063111 0.6997333 0.6960000 0.6960000

The best selected lambda.

lambda_min<-res_htlgmm$lambda_min
print(lambda_min)
## [1] 0.0499783

Real Application Example for Risk Prediction

The real application is 10-year risk of “incident” Breast Cancer

setup

The proteomics data are downloaded from https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/past-data-releases

setup

The risk factors are downloaded from https://biobank.ndph.ox.ac.uk/showcase/browse.cgi

setup

The most recent analysis comes from May 2023:

Gadd, Danni A., et al. “Blood protein levels predict leading incident diseases and mortality in UK Biobank.” medRxiv (2023): 2023-05.

Load the required package including htlgmm.

library(caret,quietly = T)
library(glmnet,quietly = T)
library(htlgmm,quietly = T)
library(pROC,quietly = T)
foldname = "incBreastCancer"

Simulated UKB data including disease status, risk factor and proteomics

The risk factors names are listed from the Stroke related ones. Record the summarized mean and standard deviance for each risk factor, including whether the predictors are binary or continuous.

risk_factor_names<-c("age","BMI","Height","Smoking_Status","Alcohol_Freq",
                     "Mother_Breast_Cancer_History","Silbings_Breast_Cancer_History",
                     "Number_of_Birth","Age_Menarche","Ever_Used_Oral_Contraceptive",
                     "Ever_Received_Hormone_Replacement_Therapy",
                     "Ever_Received_Bilateral_Oophorectomy",
                     "Ever_Had_Still_Birth","Prevalent_Benign_Breast_Disease")

mean_risk_factor<-c(59.7636,27.1449,162.1822,0.3442,0.1804,0.0758,0.0409,1.8789,12.9352,0.8001,0.4922,0.0961,0.3094,0.0083)

sd_risk_factor<-c(5.4963,5.0753,6.1417,0.4751,0.3845,0.2647,0.1980,1.1317,1.6063,0.3999,0.4999,0.2948,0.4622,0.0907)

names(mean_risk_factor)<-risk_factor_names
print(mean_risk_factor)
##                                       age 
##                                   59.7636 
##                                       BMI 
##                                   27.1449 
##                                    Height 
##                                  162.1822 
##                            Smoking_Status 
##                                    0.3442 
##                              Alcohol_Freq 
##                                    0.1804 
##              Mother_Breast_Cancer_History 
##                                    0.0758 
##            Silbings_Breast_Cancer_History 
##                                    0.0409 
##                           Number_of_Birth 
##                                    1.8789 
##                              Age_Menarche 
##                                   12.9352 
##              Ever_Used_Oral_Contraceptive 
##                                    0.8001 
## Ever_Received_Hormone_Replacement_Therapy 
##                                    0.4922 
##      Ever_Received_Bilateral_Oophorectomy 
##                                    0.0961 
##                      Ever_Had_Still_Birth 
##                                    0.3094 
##           Prevalent_Benign_Breast_Disease 
##                                    0.0083

For simplicity, we assume all binary risk factors follow binomial distribution and continuous risk factors follow normal distribution. We perform case-control analysis to allow fast computation in this simulation. The main study is assumed to be 2,000, and the external study is 9 times more, where the ratio is similar to actual case.

There are in total 1,500 proteins in UK Biobank proteomics data, we use 500 for simulation.

binary_predictors<-c("Mother_Breast_Cancer_History",
                     "Silbings_Breast_Cancer_History",
                     "Ever_Used_Oral_Contraceptive",
                     "Ever_Received_Hormone_Replacement_Therapy",
                     "Ever_Received_Bilateral_Oophorectomy",
                     "Ever_Had_Still_Birth",
                     "Prevalent_Benign_Breast_Disease")

n_main<-2000
n_external<-n_main*9
n_joint<-n_main+n_external
p_protein<-500
set.seed(2023)
risk_factor<-list()
for( i in which(risk_factor_names%in%binary_predictors)){
    risk_factor[[i]]<-rbinom(n = n_joint,size = 1,prob = mean_risk_factor[i])
}
for( i in which(!risk_factor_names%in%binary_predictors)){
    risk_factor[[i]]<-rnorm(n = n_joint,mean = mean_risk_factor[i],sd = sd_risk_factor[i])
}
risk_factor_matrix<-Reduce("cbind",risk_factor)

Check the data structure of the risk factor matrix, which is similar to the actual data structure.

colnames(risk_factor_matrix)<-risk_factor_names
round(head(risk_factor_matrix,2),2)
##        age   BMI Height Smoking_Status Alcohol_Freq
## [1,] 64.52 31.98 160.26          -0.23         0.61
## [2,] 44.70 27.05 168.29           0.38         0.20
##      Mother_Breast_Cancer_History Silbings_Breast_Cancer_History
## [1,]                            0                              0
## [2,]                            0                              0
##      Number_of_Birth Age_Menarche Ever_Used_Oral_Contraceptive
## [1,]            2.62        14.42                            0
## [2,]            3.12        13.53                            1
##      Ever_Received_Hormone_Replacement_Therapy
## [1,]                                         0
## [2,]                                         1
##      Ever_Received_Bilateral_Oophorectomy Ever_Had_Still_Birth
## [1,]                                    0                    0
## [2,]                                    0                    0
##      Prevalent_Benign_Breast_Disease
## [1,]                               0
## [2,]                               0

The proteins are assumed to follow N(0,1).

proteomics_matrix<-matrix(rnorm(n_joint*p_protein),nrow = n_joint,ncol = p_protein)
eg_proteomics<-proteomics_matrix[1:6,1:5]
colnames(eg_proteomics)<-c("aarsd1","abhd14b","abl1","acaa1","ace2")
eg_proteomics
##          aarsd1     abhd14b       abl1      acaa1          ace2
## [1,] -1.1105242 -0.01414405 -1.0864208 -0.5041349  4.490670e-01
## [2,]  1.0969956  0.81478464 -0.3507938 -1.2923906 -1.026354e-01
## [3,]  0.1096429  1.17141003 -0.6785858  1.6844419 -6.698569e-05
## [4,] -0.4895755  0.09112312 -0.5685032 -0.6172262 -5.422962e-01
## [5,]  1.0021940  0.66385649 -1.4594543  0.6078855 -1.003801e+00
## [6,] -1.0121142 -0.64776973 -0.4683433 -1.1744974 -1.432753e+00

We assume the disease status follows binomial distribution, and the case:control ratio to be roughly 0.1.

coef_risk_factor<-c(0.1172,0.1216,0.1067,-0.0312,0.0816,0.1038,0.0755,-0.0507,-0.0099,-0.0099,0.0376,-0.0266,-0.0251,0.0628)
coef_proteomics<-rep(0,p_protein)
set.seed(2024)
coef_proteomics[sample(1:p_protein,20)]<-rnorm(20,mean=0,sd = 0.1)

mean_case<-0.1
Y<-rbinom(n_joint,size = 1,prob =expit(
    log(mean_case/(1-mean_case))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(proteomics_matrix%*%coef_proteomics)))
print(paste0("simulated disease prevalence:",sum(Y)/n_joint))
## [1] "simulated disease prevalence:0.11255"

Separate the simulated data into main study and external study.

main_study<-data.frame(Y = Y[1:n_main],risk_factor_matrix[1:n_main,],proteomics_matrix[1:n_main,])

external_study<-data.frame(Y = Y[-c(1:n_main)],risk_factor_matrix[-c(1:n_main),])

Scale the predictors from external study. Extract summary statistics from external study.

external_study[,-1]=scale(external_study[,-1])
family = "binomial"
ext_study_info=list()
external_glm=glm(Y~.,data = external_study,family = family)
study_external = list(
                Coeff=external_glm$coefficients[-1],
                Covariance=vcov(external_glm)[-1,-1],
                Sample_size=nrow(external_study))
ext_study_info[[1]] = study_external
#ext_study_info

Split main study into train and test with the proportion (70% and 30%).

library(caret)
set.seed(2024)
test_index=createDataPartition(main_study$Y,p = 0.3)$Resample1

y and X are training set for main study, while y_test and X_test are test set for main study. We perform normalization on X and apply the same normalization to X_test.

y = main_study$Y[-test_index]
X = as.matrix(main_study[-test_index,-1])
y_test = main_study$Y[test_index]
X_test = as.matrix(main_study[test_index,-1])
prefunc <- preProcess(X, method = c("center", "scale"))
X = predict(prefunc, X)
X_test = as.matrix(predict(prefunc, X_test))        
p_risk = ncol(external_study)-1

Train with Main Study only by cv.glmnet

Perform cv.glmnet only for main study’s training set and check the performance on main study’s test set. The criterion is AUC.

start_t=Sys.time()
main_lasso=cv.glmnet(x=X, y=y, family="binomial")
end_t=Sys.time()
sprintf('cv.glmnet time: %.2f %s', end_t-start_t, units(end_t-start_t))
## [1] "cv.glmnet time: 17.70 secs"
pred_main_lasso=c(predict(main_lasso,newx=X_test,s="lambda.min"))
suppressMessages(auc_main_lasso<-auc(y_test,expit(pred_main_lasso)))

Train with Main study and External Study by cv.htlgmm

Perform cv.htlgmm for main study’s training set with external study and check the performance on main study’s test set.

start_t=Sys.time()
main_external_htlgmm=htlgmm::cv.htlgmm(y = y,
                               Z = X[,c(1:p_risk)],
                               W = X[,-c(1:p_risk)],
                               ext_study_info = ext_study_info,
                               penalty_type = "lasso",
                               beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
                               family = "binomial")
end_t=Sys.time()
sprintf('cv.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))
## [1] "cv.htlgmm time: 20.23 secs"

Perform cv.htlgmm with the option of weight tuning by setting tune_weight = TRUE.

start_t=Sys.time()
main_external_htlgmm_tune=htlgmm::cv.htlgmm(y = y,
                               Z = X[,c(1:p_risk)],
                               W = X[,-c(1:p_risk)],
                               ext_study_info = ext_study_info,
                               family = "binomial",
                               penalty_type = "lasso",
                               beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
                               tune_weight = TRUE)
end_t=Sys.time()
sprintf('cv.htlgmm with weight tuning time: %.2f %s', end_t-start_t, units(end_t-start_t))
## [1] "cv.htlgmm with weight tuning time: 40.34 secs"

Performance Comparison

Check the AUC comparison between cv.glmnet and cv.htlgmm.

pred_main_external_htlgmm=c(X_test%*%main_external_htlgmm$beta[-1])
suppressMessages(auc_main_external_htlgmm<-pROC::auc(y_test,expit(pred_main_external_htlgmm)))

pred_main_external_htlgmm_tune=c(X_test%*%main_external_htlgmm_tune$beta[-1])
suppressMessages(auc_main_external_htlgmm_tune<-pROC::auc(y_test,expit(pred_main_external_htlgmm_tune)))

cat(paste0("AUC on test data: main study only lasso(",round(auc_main_lasso,4) ,");\n htlgmm(", round(auc_main_external_htlgmm,4),");\n htlgmm_tune_weight(",round(auc_main_external_htlgmm_tune,4),")"))
## AUC on test data: main study only lasso(0.544);
##  htlgmm(0.5803);
##  htlgmm_tune_weight(0.591)

Plot to show the change of cross validation deviance against log lambda (tuning parameter for lasso penalty).

df = data.frame(lambda = main_external_htlgmm$lambda_list,
                cv_dev=main_external_htlgmm$cv_dev,
                cv_auc=main_external_htlgmm$cv_auc)
library(ggplot2)
ggplot(df)+
  geom_line(aes(x=log(lambda),y=cv_auc))+
  geom_vline(xintercept = log(main_external_htlgmm$lambda_min),
             linetype=2)+
  theme(panel.background=element_rect(fill='transparent', color='black'),
        axis.text=element_text(size=14,face="bold"))+
  geom_text(aes(x=log(main_external_htlgmm$lambda_min),y=median(cv_auc),label='lambda with max cv auc'))


GWAS version of HTLGMM

In this example, simulation example is generated to illustrate the application of gwas.htlgmm.

We simulate SNP_matrix by binomial distribution, where the allele frequencies are simulated from uniform distribution. We consider 100 SNPs, with main study sample size 10k and external summary statistics with sample size 100k.

The adjustment covariates include age, sex, BMI. Assume age and sex are adjusted from external summary statistics, while BMI is not adjusted. We can further adjust BMI using both main study and external summary statistics by gwas.htlgmm.

Simulation Setup

Simulate SNP matrix and covariates

library(htlgmm)
snp_count<-100
n_ext<-1e5
n_main<-1e4
n_joint<-n_main+n_ext
set.seed(2024)
allele_freq<-runif(snp_count,min = 0.1,max = 0.9)
snp_matrix<-sapply(1:snp_count, function(i){
    rbinom(n=n_joint,size=2,prob=allele_freq[i])})

age<-rnorm(n_joint,mean = 60,sd = 5)
sex<-rbinom(n_joint,size=1,prob = 0.5)
BMI<-rnorm(n_joint,mean = 27,sd = 5)

Simulate binary outcome by randomly setting casual SNPs

true_snp_position<-sample(1:100,5)
coef_snp<-rep(0,snp_count)
coef_snp[true_snp_position]<-log(1.2)*sample(c(1,-1),5,replace=TRUE)
print(paste0("True Causal SNP Positions: ",paste(sort(true_snp_position),collapse = ", ")))
## [1] "True Causal SNP Positions: 7, 10, 19, 27, 98"
coef_risk_factor<-c(0.1172,-0.05,0.1216)
risk_factor_matrix<-cbind(age,sex,BMI)

t2d<-rbinom(n_joint,size = 1,prob = expit(
    log(0.1/(1-0.1))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(snp_matrix%*%coef_snp)))
table(t2d)
## t2d
##     0     1 
## 98858 11142

Split simulated data into main study and external study

A consist of age and sex; W only includes BMI; Z only includes each SNP. In general case, A, and W can be lists of covariates. For the A input of gwas.htlgmm, a column of 1 is required for intercept term.

A<-cbind(1,risk_factor_matrix[1:n_main,c(1,2)])
AE<-cbind(1,risk_factor_matrix[-c(1:n_main),c(1,2)])
W<-risk_factor_matrix[1:n_main,3,drop=F] 
y<-t2d[1:n_main]
yE<-t2d[-c(1:n_main)]
snp_matrix_main<-snp_matrix[1:n_main,]
snp_matrix_ext<-snp_matrix[-c(1:n_main),]

Simulate External GWAS Summary Statistics

The simulated external summary statistics ext_study_info is corresponding to the external summary statistics we usually obtain from a large consortia.

ext_study_info<-list()
for(i in 1:snp_count){
    ZE<-snp_matrix_ext[,i]
    resglm<-glm(y~.,data = data.frame(y=yE,ZE,AE[,-1]),family = "binomial")
    study_external = list(
        Coeff=resglm$coefficients[2],
        Covariance=vcov(resglm)[2,2])
    ext_study_info[[i]] <- study_external
}
print(ext_study_info[[1]])
## $Coeff
##           ZE 
## -0.008973933 
## 
## $Covariance
## [1] 0.0003068138

Requirement of gwas.htlgmm

gwas.htlgmm requireds:

  • (1) : All the covariates (A & W) except intercept term should be centered with mean 0;

  • (2) : The SNP variable also needs to be centered with mean 0;

  • (3) : With provided beta_initial, the beta_initial should be corresponding to be the centered version and follow the order of (A, W, Z).

# center A and W except intercept term
A[,-1]<-scale(A[,-1],scale=F)
W<-scale(W,scale=F)

Train with Main Study and External GWAS by gwas.htlgmm

Key Parameter List of Function gwas.htlgmm

  • y : Outcome of Interest. Binary of continuous.

  • Z : SNP.

  • W : Covariate of Interest, Unadjusted in External GWAS.

  • A : Study-specific Adjustment Factors. All the covariates adjusted in external summary statistics. When family = “binomial”, a column with 1 should be added for intercept term.

  • ext_study_info : External GWAS Summary Statistics. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance for the SNP.

  • family : linear (“gaussian”) and logistic (“binomial”).

  • beta_initial :

  • repeated_term : The repeated computation which can be reused.

  • output_SNP_only : Only output SNP updated estimated coefficient, estimated variance. Otherwise, information of all covariates can be outputted.

  • output_tmp : Output repeated_term or not.

Note that repeated_term should be computed in advance to accelerate the computation.

start_t=Sys.time()
library(speedglm,quietly = T)
# first compute the repeated terms. 
repeated_term <- gwas.htlgmm.repeated.term(y,A,"binomial")
beta_list<-c()
se_list<-c()
for(i in 1:snp_count){
    # center SNP
    Z<-snp_matrix_main[,i]
    Z<-scale(Z,scale=F)
    # beta initial 
    resglm<-speedglm(y~.,data = data.frame(y=y,A[,-1],W,Z),family=binomial())
    beta_initial<-resglm$coefficients
    # find external gwas
    GWAS_ext_study_info = list()
    GWAS_ext_study_info[[1]] = ext_study_info[[i]]
    res.gwas<-gwas.htlgmm(y,Z,W,GWAS_ext_study_info,A,"binomial",beta_initial,repeated_term=repeated_term)
    beta_list<-c(beta_list,res.gwas$beta)
    se_list<-c(se_list,sqrt(res.gwas$variance))
}
end_t=Sys.time()
sprintf('gwas.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))
## [1] "gwas.htlgmm time: 2.71 secs"

Verify the results

In our simulation, because BMI works as a general strong risk factor, without BMI, the GWAS summary statistics will not result in correct conclusion.

pval_list_ext<-sapply(1:snp_count, function(i){
  pchisq(ext_study_info[[i]][[1]]^2/ext_study_info[[i]][[2]]^2,1,lower.tail = F)
})
pval_list_htlgmm<-pchisq(beta_list^2/se_list^2,1,lower.tail = F)
print(paste0("External GWAS identifies siginicant SNP number of : ",sum(pval_list_ext<0.05/100)))
## [1] "External GWAS identifies siginicant SNP number of : 93"
print(paste0("gwas.htlgmm identifies siginificant SNP positions: ",paste(which(pval_list_htlgmm<0.05/100),collapse = ", ")))
## [1] "gwas.htlgmm identifies siginificant SNP positions: 7, 10, 19, 27, 98"

Accelerate gwas.htlgmm by plink2

Tutorial of plink2 with example files

Check this link for introduction of plink2: http://ruzhangzhao.github.io/tutorial/plink2

To be updated.

Fast Matrix Production & Inverse

In our package htlgmm, we provide fast implementation